Spatio-Temporal Small Worlds for Decentralized 
Information Retrieval in Social Networking 



Georg Groh Florian Straub Benjamin Koster 

TU Munchen ETH Zurich TU Munchen 

Faculty for Informatics Inst, of Cartography and Faculty for Informatics 

grohg@in.tum.de Geoinformation koster@in.tum.de 

v " straubf@ethz.ch 



ABSTRACT 

We discuss foundations and options for alternative, agent- 
based information retrieval (IR) approaches in Social Net- 
working, especially Decentralized and Mobile Social Net- 
working scenarios. In addition to usual semantic contexts, 
these approaches make use of long-term social and spatio- 
temporal contexts in order to satisfy conscious as well as un- 
conscious information needs according to Human IR heuris- 
tics. Using a large Twitter dataset, we investigate these 
approaches and especially investigate the question in how 
far spatio-temporal contexts can act as a conceptual bracket 
implicating social and semantic cohesion, giving rise to the 
concept of Spatio-Temporal Small Worlds. 

Categories and Subject Descriptors 

H. 4 [Information Systems Applications]: Miscellaneous 

Keywords 

Collaborative (Geographic) Information Retrieval, Spatial 
Context, (Geo) Social Networks, Spatial Context, Human 
Search, Small World Networks, Data Analysis, Information 
Needs. 

I. INTRODUCTION 

Social Networking (SN) and Decentralized Social Network- 
ing (DSN) [53] as a future variant of SN is extensively used 
to build rich personal and interpersonal information spaces. 
Furthermore, the increased access of SN-platforms via mo- 
bile devices such as smartphones (giving rise to new paradigms 
such as (context-aware) Mobile Social Networking (MSN)) 
introduces a steeply growing permeation of these informa- 
tion spaces with explicit spatial context. Thus, besides social 
contexts such as 'friendship' relations, spatio-temporal con- 
texts and their interrelations with social contexts are also 
available and extensively used in modern (M)SN platforms. 

These upcoming SN paradigms allow users more and more 
to employ special forms of information retrieval, akin to tra- 
ditional human information seeking behavior based on the 
real social network of society ('Human IR') which, besides 
semantic context, also uses social and spatio-temporal con- 
text (see also [52]). 

Inspired by this behavior, the question now arises how 
alternative IR services for SN may be constructed that ef- 
fectively make use of social, semantic, and spatio-temporal 
contexts and their interrelations. 

Pursuing this research question, the reminder of this pa- 
per is structured as follows. After a brief discussion of the 



relation between context and information needs, we address 
Human IR and wayfinding in social networks. We then in- 
troduce the concept of Spatio-Temporal Small Worlds for 
IR in Social Networking as well as a respective architec- 
ture based on personal information agents. The following 
main part of the paper empirically investigates the concept 
of Spatio-Temporal Small Worlds and the suitability of the 
principles guiding alternative IR processes inspired by Hu- 
man IR, using social search, semantic search and spatio- 
temporal search and here especially the suitability of spatio- 
temporal embedding as a contextual bracket using a large 
Twitter dataset. 

This paper is an extended version of the content of the 
paper [13]. Elements of this text also appear in the thesis 
[14]- 

2. RELATED WORK AND FUNDAMENTAL 
CONSIDERATIONS 

2.1 Context and Unconscious Information 
Needs 

In [40] adequate characterizations of relevance in infor- 
mation retrieval (IR) and especially qualifications of infor- 
mation needs that a user of IR has in view of a 'problematic 
situation' [7], [8], [40] are investigated. In this regard, the 
concepts query, request, perceived information need (PIN), 
and real information need (RIN) are considered as central. 
The query is a formalization of a request which, in turn, 
is a natural language expression of a PIN. The PIN is the 
information need that a user subjectively perceives in the 
problematic situation. The RIN may e.g. be defined via 
the entirety of information that is 'objectively' relevant for 
the solution of the problem, thus extensionally defining the 
'problem' in 'problematic situation' through the RIN. 'Ob- 
jectively' may e.g. be determined by the intersection or 
union of the assessed RIN by the fictional set of all human 
experts for the problem. 

During the IR process the user then consumes or partly 
consumes the results, uses his assessment of relevance judg- 
ments, corrects his PIN, formulates a new query and so on, 
giving rise to a circular IR process (see e.g. [6]). A user 
will explore the space of information relevant to the RIN by 
repeated executions of the aforementioned IR cycle, itera- 
tively re-shaping his PIN, and enlarging the set of acquired 
information. 

Our notion of conscious information need corresponds to 
perceived information need (PIN) in [40] and our notion of 
unconscious information need encompasses the real informa- 



tion need (RIN) in [40]. In IR, the term unconscious infor- 
mation need is justified because the user is not consciously 
aware of information needs in RIN \ PIN (that are in RIN 
but not in PIN) in a 'problematic situation'. However, our 
notion of unconscious information need also encompasses an 
unspecific readiness to accept 'interesting' information. Un- 
less artificially defining some 'background problematic sit- 
uations', ongoing readiness to accept welcomed information 
that does not correspond to a 'problematic situation 7 (and 
thus not to a RIN or PIN) is not represented in the schema 
of IR relevance. This case is simply not covered by the con- 
cept of information retrieval, where a problematic situation 
induces a concrete information need which in turn finally in- 
duces a query. Examples for such a form of unconscious in- 
formation need correspond to e.g. a user reading 'something 
interesting' on a news-feed or is being told 'something inter- 
esting' by a friend, etc. Thus information may be delivered 
to a user that the user has no a priori perceived information 
need for, and which the user has not explicitly asked for via 
a query or filter, but that he / she nevertheless judges as 'in- 
teresting'. This kind of information is usually pro-actively 
delivered by awareness services, or news services, or by di- 
rect communication services [14]. 

Context and especially social context may be used to pro- 
vide a relevance bracket for this 'interesting information' 
that is delivered to a user by such services by e.g. nar- 
rowing the visualizations of social network dynamics to the 
network neighborhood or spatio-temporal neighborhood of 
a user [17], using social filtering to deliver horizon broaden- 
ing recommendations [16], or using social contexts to specify 
suitable audiences for certain information [15]. 

The contextual relevance bracket is a means to anticipate 
or induce relevance via context in these proactive services 
[14] . Incorporation of context, especially of social and spatio- 
temporal context, can be especially useful for information 
retrieval e.g. by aiding the user in exploring the space of 
relevant information items / in expanding the PIN, espe- 
cially in relation to problems for which the RIN is hard to 
determine. This aid can be achieved by seeding the IR cycle 
with new motives especially beyond the PIN while providing 
a certain contextual bracket for relevance (in contrast to e.g. 
randomly choosing the seeds) as Figure 1 illustrates. In con- 
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Figure 1: Defining unconscious information need 
[14] 

trast to well defined problems, which may exhibit a natural 
saturation effect in view of new information, insights, com- 



petence gains, or perspectives appearing after new IR cycles 
and thus PIN^RIN after 'sufficiently' many IR cycles, the 
'problematic situations' for which the RIN is hard to deter- 
mine might not exhibit this saturation effect, either because 
the problem's definition is not precise enough or because the 
space of information items relevant to the RIN is very large. 

Traditional Context-Sensitive Information Retrieval is usu- 
ally focused on using types of context such as query histo- 
ries or implicit feedback on the results to a query (e.g. via 
click-analysis or eye-tracking) to improve relevance of the 
immediately retrieved results in view of a given query (see 
e.g. [47]). However it is usually limited to the PIN expressed 
in the query, because more general contextual brackets (e.g. 
induced by social context) that would be able to deliver the 
contextual seeds mentioned before are missing or not re- 
garded. E.g. including seeds from the information spaces of 
other competent people determined via (besides the query) 
also taking social context into consideration, may improve 
the exploration of the RIN, especially in those cases where 
the boundaries of the RIN are hard to determine precisely. 

2.2 Human IR and Wayfinding in Social Net- 
works 

If long-term social contexts in the form of social networks 
are used to provide contextual brackets for information re- 
trieval services in SN / MSN, it is important to review the 
basic results of decentralized routing and searching in these 
networks [33]. 

In 1967, Milgram's experiment [39] showed that decen- 
tralized routing in social networks is possible and that the 
path lengths involved were small Watts and Strogatz [50] 
were able to provide a network model for such Small World 
networks, which did not only explain their short mean av- 
erage path length but also their high clustering coefficient 
(the network theoretic measure for triadic closure), a cru- 
cial property of social networks. The Watts- Strogatz model 
is based on a toroidally, regularly linked graph, where edges 
are randomly redirected with a certain probability (short- 
cuts) . These constructive elements generate the local cluster 
structure and short mean average path length. [50]. 

While such models were able to explain the basic structure 
of social networks, the actual explanation of the Milgram 
experiment, the question of how decentralized wayfinding or 
routing can actually be accomplished, was investigated by 
Kleinberg [29]. In his variant of the Small World model, 
starting from a regularly linked network on a grid, the ran- 
dom distant re-connections of a node a to a node b were 
established with a probability d(a,b)~ a . He was able to 
show that for a corresponding to the dimension of the grid, 
a decentralized (local knowledge only) routing algorithm, al- 
ways choosing the node located closest to the target node 
as the next node, is sufficient to produce 'sufficiently' short 
expected delivery times, polynomial in 0(log(n)), where n 
is the number of nodes in the network. Refinements of this 
model in view of more realistic geographic distributions of 
friendship relations on the earth's surface were investigated 
by [34] , suggesting a different geographic connection proba- 
bility distribution and empirically finding a different value 
for a, but confirming that the simple greedy local routing al- 
gorithm still leads to efficient delivery. This confirms that for 
efficient decentralized geographic routing in social networks, 
the nodes (actors) of the network need to be spatially embed- 
ded (e.g. have a known center of life) and each forwarding 



actor needs to have a cognitive model of this spatio-temporal 
context. 

More generally, besides spatial proximity other types of 
contextual metrics such as other long-term social contexts 
(e.g. occupation or hobbies) may as well be chosen to select 
the next node. The greedy local social search will select as 
the next node the node closest to the target node according 
to the given metrics (see e.g. [33]). 

Parallels exist between using general context information 
for decentralized routing and the way social information re- 
trieval is accomplished in human societies, which in turn has 
obvious commonalities with SN / MSN. In ' Human IR\ a 
question formalizing a PIN is 'routed' to persons which pre- 
sumably dispose of the required information in their (not 
necessarily properly explicated) information spaces. The re- 
sulting routes need to be 'socially resilient' enough (e.g. in 
the sense of Granovetter's strong ties [12]) to support the ac- 
tors en-route agreeing to process the query and to support 
routing the retrieved information back to the questioner. At 
the same time the routes must contain enough weak ties (in 
Granovetter's sense) to convey new information or provide 
access to otherwise hardly reachable parts of the network 
via weak tie shortcuts in the sense of [50] [11]. 

As reviewed in [52], human information seeking behavior 
often use context e.g. social context to determine actors 
who could be asked, especially if the problem situation and 
the PIN is poorly defined ([52]). Actors facing an informa- 
tional problem will, besides the PIN (= WHAT), evaluate all 
types of contexts, their interrelations and their relations to 
the PIN, in order to render their PIN more precise, expand 
their PIN towards the RIN and ultimately collect enough 
information to solve their problem (see [18] for a more elabo- 
rate discussion). For the discussion, types of contexts will be 
represented by other interrogative pronouns such as WHO 
(pointing to social context), WHERE and WHEN (pointing 
to spatio-temporal context). Vice versa, the asked persons 
may also use contextual knowledge to select appropriate in- 
formation for the questioner, which may also include infor- 
mation that is not strictly relevant to the query but relevant 
to the PIN or even RIN of the questioner. Thus relevance 
may also be induced by the asked actor via contextual knowl- 
edge. As an example consider the question "How do I search 
for certain terms while I browse a text-document with UNIX 
'more' ?". As an expert, a person might answer "Use the '/' 
character and enter the term". As an expert and friend the 
answer may include "Besides: use 'less' instead of 'more' ! 
It has a number of advantages". As an expert and close 
friend the answer may include "Besides: I give You the ad- 
vice to quit using UNIX! A Mac will suit Your needs and the 
needs of Your wife much better. It provides more comfort- 
able means to view and search text-files while still retaining 
'less' and 'more' if desired", using social context and the 
questioner's individual context. 

In terms of long-term social context, Human IR 'uses 7 the 
main characteristics of small world networks to search in 
the complex network of distributed information spaces and 
context-elements for the right information: actors are able 
to reach experts (and their information spaces) via short 
expected path lengths and the highly clustered structure 
ensures that each actor has a large number of routing op- 
tions. Suitable interdependent contextual metrics (Seman- 
tic (WHAT), social (WHO) or spatio-temporal (WHERE + 
WHEN)) allow efficiently navigating the space. 



3. SPATIO-TEMPORAL SMALL WORLDS 
FOR IR IN SOCIAL NETWORKING 

The question now arises, how we can employ these con- 
siderations and the considerations of the preceding section 
to construct an alternative information retrieval service for 
SN / MSN. While the complex socio-psychological mechan- 
ics of amalgamating and evaluating the interdependencies 
of WHO +± WHERE ^ WHAT +± WHEN in Human IR 
in view of searching the distributed information spaces in 
a context sensitive way are too intricate to model directly, 
spatio-temporal embedding may act as a reference point and 
a means to naturally encode these interdependencies be- 
tween the various forms of context for a respective IT model. 

3.1 Spatio-Temporal Small Worlds 

A social spatio-temporal small world may be defined as a 
social network, where the actor-nodes are spatio -temporally 
embedded into space-time via their current center of life 
(compare previous section). The relations correspond to di- 
rected long-term social relations of various types. We have 
seen that spatial distance metrics (and via using a current 
time-frame thus also spatio-temporal distance metrics) al- 
low efficient decentralized routing. We assume that spatio- 
temporal distance metrics can thus also serve as one key 
means for a successful search for information in the social 
spatio-temporal small world part of the complex network 
of distributed information spaces and context-elements de- 
scribed in the previous section. 'Successful' implies that the 
information found is relevant in view of a user's RIN espe- 
cially in those cases where the RIN is hard to demarcate 
(see discussion in subsection 2.1). Another argument for 
using spatio-temporal distance metrics as a means to nat- 
urally encode interdependencies between the various forms 
of context or other metrics is that the studies of Kleinberg 
[30] and Liben-Nowell [34] imply that in a social spatio- 
temporal small world, spatio-temporal closeness is proba- 
bilistically correlated with social closeness. 

A semantic spatio-temporal small world may be defined 
as a network of information items (e.g. documents) that are 
spatio -temporally embedded into space-time via certain crite- 
ria. Either the information item's meta-data contains an ex- 
plicit spatio-temporal embedding or implicit spatio-temporal 
embedding, e.g. explicated spatially via geo-parsing (see e.g. 
[32] [28]) and geo-coding (see e.g. [28]) of the found named 
entities (see e.g. [41]). A third case applies if the informa- 
tion item is spatio-temporally embedded in the same spatio- 
temporal location (s) as the actor whose information space 
this item is associated with. 

The first mode of edges of this network are the links indi- 
cating semantic relatedness of the items (e.g. HTTP links). 
The corresponding network has small world properties [26]. 
The second mode of edges relates items, whose 'owners' are 
linked in the social spatio-temporal small world, which also 
gives rise to a network with small world properties. 

As previously discussed, social closeness is probabilisti- 
cally correlated with spatial (and implicitly spatio-temporal) 
closeness [33] [46]. 

Studies by Brent Hecht [22], [23], [21], [24] and others (e.g. 
[35]) imply that in a semantic spatio-temporal small world, 
spatio-temporal closeness is probabilistically correlated with 
semantic closeness to a certain extend, which is also ex- 
pressed as a statistical tendency in (so-called) Tobler's first 
law of Geography: "everything is related to everything else, 



but near things are more related than distant things" [49]. 
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Figure 2: Spatio-temporal embedding of small 
worlds: how spatio-temporal embedding maintains 
social and semantic closeness properties as a statis- 
tical tendency [14] [18]. 

Figure 2 visualizes social and semantic spatio-temporal 
small worlds and illustrates the maintenance of social and 
semantic closeness via spatio-temporal embedding. 

Social closeness is also probabilistically correlated with se- 
mantic closeness. Homophily (the tendency of similar peo- 
ple to associate with each other, contributing to triadic clo- 
sure) [38] and Peer Influence (the influence of persons which 
are directly linked in the social network) [10] can be promi- 
nently attributed for the local homogeneity in terms of infor- 
mation spaces of social groupings. The correlation between 
social closeness and semantic / topical closeness is also sup- 
ported by other studies such as [5] and indirectly by [16]. 

Thus in view of decentralized search of relevant infor- 
mation in the complex network of distributed information 
spaces and context-elements which is characteristic of SN 
/ MSN, we assume that social spatio-temporal small worlds 
and semantic spatio-temporal small worlds may act as a sim- 
ple model of this complex network of context elements and 
spatio-temporal metrics may aid the decentralized search be- 
cause of implicitly representing interrelations between spatio- 
temporal, social, and semantic relatedness. 

Based on these considerations and models and the princi- 
ples of Human IR, the study [18], proposed a new context- 
aware, agent-based, federated approach to information re- 
trieval in decentralized SN / MSN in order to investigate 
limits and chances of using spatio-temporal embedding and 
its implicit 'conservation' of semantic and social context as 
a contextual bracket. Besides the discussion of the last sec- 
tions, the design decisions in this study were supported by 
a number of observations such as the ever growing availabil- 
ity of context in SN and especially MSN, the importance 
of the paradigm of Distributed Social Networking [53], the 
problems that the Hidden Web especially in connection with 
access protected SN / MSN information spaces generates for 
traditional search engines [20], or the obvious parallels that 
searching in SN / MSN has to Human IR. 

3.2 An Architecture based on Personal Infor- 
mation Agents 

The architecture of [18] is based on personal information 
agents associated with spatio -temporally embedded social ac- 



tors (users, companies, SN-platforms etc.), which contextu- 
ally decide upon the execution of another actor's query in 
relation to the asked actor's information space. The agents 
are able to answer these queries in a context sensitive way, 
using techniques from Context-Sensitive IR and their exper- 
tise on their own information spaces. Each actor maintains 
socio-semantic links to other spatio-temporally embedded 
actors in form of topic specific expert-links, thus implement- 
ing a special form of social spatio-temporal small world. 
Furthermore, each actor publishes a selection of his / her 
expert-links and a set of spatio-temporally embedded exper- 
tises, summarizing content fields from the actor's informa- 
tion space (thus contributing to a special form of a semantic 
spatio-temporal small world). The spatio-temporal embed- 
ding of expert-links and expertises knowledge flags') fol- 
lows the three step process discussed above. These knowl- 
edge flags are published in a decentralized spatio-temporal 
Peer-to-Peer index. If an actor issues a query which cannot 
be answered from his own information space, a social search 
is performed using the actor's expert links. If this search also 
fails, the spatial index is queried using the spatio-temporal 
embedding of the query, with the embedding following the 
three step process: e.g. if the query does not contain a 
spatio-temporal reference, the spatio-temporal reference of 
the questioner (see subsection 2.2) is used. The search de- 
livers a number of knowledge flags which the questioner's 
agent then further evaluates by asking the related other 
agents. The system thus combines elements of social search 
(via expert- links), semantic search (local IR-systems) and 
spatio-temporal search (implying social and semantic con- 
texts to a certain extent as explained above). 

Compared to e.g. Peer-to-Peer (P2P) IR systems, were 
e.g. an index is distributed over the passively protocol- 
executing peers in a P2P network (see e.g. [48] for a hy- 
brid document- / index-distribution approach), and thus in 
most cases basically 'merely' distributes a conventional IR 
system over a P2P network, this architecture uses the ac- 
tor's agent's local IR systems to locally decide upon rele- 
vance. The agents are thus able to take into account the 
(e.g. social) context of the query and the querying agent 
/ its user, thus being able to optimize contextual relevance 
and decide upon access [15] to control information flows, en- 
sure privacy or even employ information markets [15]). Fur- 
thermore, they are able to pro-actively keep their published 
knowledge flags up-to-date. 

The small world structure of the networks involved en- 
sures that the expert-links, the comparatively coarse seman- 
tic mapping of the agent's information spaces in form of the 
expertises, and with the comparatively coarse implicit con- 
serving of semantic and social contexts via spatio-temporal 
embedding is sufficient to deliver enough contextual seeds to 
reach enough competent agents which can then either em- 
ploy their local IR systems to deliver contextually relevant 
items or use the private parts of their expert link list to fur- 
ther forward the query if the questioner's context is match- 
ing (e.g. if the corresponding user is a friend) resembling 
Human IR. 

4. STUDY 

4.1 Methodology 

Some elements of the architecture (such as the specially 
designed spatio-temporal P2P Quad- Tree) were evaluated 



using a dataset based on spatially referenced Wikipedia ar- 
ticles, demonstrating their practicability (see [19], [31]). De- 
spite not disposing of a full implementation and evaluation 
scenario involving the necessary large number of actors and 
sub- systems, another evaluation step that can be taken is 
to evaluate the suitability of the principles guiding the ar- 
chitecture's IR process inspired by Human IR, using social 
search, semantic search and spatio-temporal search and here 
especially the suitability of spatio-temporal embedding as a 
contextual bracket for this type of IR, implying social and 
semantic contexts to a certain extent as explained above. 
For this evaluation, a data-set is required that contains real 
association of users and information items as well as realis- 
tic locations of users and explicit spatio-temporal references 
of their information items, as well as a social network ex- 
hibiting characteristics of the expert-link network proposed 
in the architecture. The micro-blogging service Twitter [4] 
with his network of followers, significant share of mobile us- 
age and thus a large share of explicit spatial embeddings, 
and the free availability of the data is a suitable evaluation 
ground. We will now discuss some results of this evaluation. 



4.2 Dataset 

A dataset from Twitter was downloaded in June and July 
2010, using the Twitter API [4]. The Tweets and Re-Tweets 
which were non-English (which was decided using the ap- 
proach described in [9], employing an ML classifier using 
language specific n-gram statistics) were discarded. The re- 
maining (Re-) Tweets were Porter-stemmed [44] and stop- 
words were removed. Of the Re-Tweets, only the additional 
content without 're-citing' the original Tweet was regarded. 
An undirected social network between the users was in- 
duced by establishing an edge if at least one ©Reply or 
©Mention [4] (roughly corresponding to a direct message) 
was exchanged between the respective users. Of this so- 
cial network, the largest connected component was chosen, 
and the rest of the users and their Tweets and Re-Tweets 
discarded. We downloaded the complete information from 
43973129 Tweets and Re- Tweets, of which 9725514 were ex- 
plicitly geo-coded. 3323803 of these geo-coded entities were 
associated with the largest connected component of our so- 
cial network and finally considered. Of the 6887632 users in 
the dataset, 670271 were explicitly geo-coded and 160690 of 
these belonged to the largest component of the social net- 
work that we considered. 

Users were spatially embedded via the geo-location of their 
last available explicitly geo- located (Re-) Tweet. (Re-) Tweets 
not explicitly spatially embedded (via geo-coordinates) were 
embedded with a simple geo-parsing approach, analyzing 
the strings denoting the location and subsequently using 
the MetaCarta geo-coding service [3]. If this process failed, 
the geo-location of the Wikipedia articles corresponding to 
the tags of the respective (Re-) Tweet, were used, using the 
Wikapidia API (see previous section). If that fails, the lo- 
cation of the authoring user was used. Locations were sub- 
jected to very small (uniform distribution in [-0.1,0.1] dec- 
imal degrees) random deviations to avoid mapping many 
entities to the exact same location which would result in 
overcrowding peers with respect to the Quad- Tree based 
spatio-temporal index which was used in the evaluation en- 
vironment for the experiments. 



4.3 Interrelations between Spatio Temporal, 
Social and Semantic Contexts 

The social network's mean average path length was 6.92 
(a random graph with the same number of nodes, which was 
computed with the help of the JUNG framework [43] yielded 
a value of 8.96), and the average clustering coefficient [50] of 
the social network was 0.046 (corresponding random graph: 
0.000014). We see that although the average clustering co- 
efficient on SN platforms is usually higher by a factor of 
> 4 (e.g. [51] report an average 0.164 for their early 2009 
crawl of several sub-networks of Facebook with an overall 
number of nodes of ~ 10 6 ). The numbers indicate that the 
present network can still be regarded as having small world 
properties. 

Figure 3 shows statistical properties of the dataset and 
correlation effects that support the mutual implication of 
social, semantic and spatial closeness which represents a ba- 
sis for the proposed IR architecture. Sub-Figure 3(a) shows 
the degree distribution of social network which roughly fol- 
lows a power law. This fact and the deviations from the ex- 
act power law distribution coincide with the findings in [42] 
[34]. Together with the previously discussed values for the 
mean average path length and clustering coefficient shows 
that the social network of actors in the data-set can indeed 
be assumed to be a realistic small world social network. 

Sub-Figure 3(b) shows a distribution of the number of 
Tweets and Re-Tweets per user which, in our experiment 
simulate the information spaces of the users. While the 
Re- Tweet distribution follows a power law, the distribution 
of the number of Tweets shows deviations from the power- 
law distribution, while the R 2 -value of fitting an exponential 
function y{x) — ae~ bx is significantly lower, supporting that 
a pure exponential fit is less appropriate. Functions of the 
type y{x) — f3x~ a -\-ae~ bx , which induce an exponential cut- 
off of the power-law's long tail, qualitatively show a better 
congruence with the distribution and intuitively correspond 
to the reasonable assumption that extremely large sizes of 
information spaces of users in SN and MSN platforms are 
very rare. 

Sub-Figure 3(c) shows the distribution of spatial (geodesic) 
distance between adjacent nodes (actors with a direct so- 
cial relation) in the social network. Equivalence classes of 
geodesic distances are determined in steps of 10 km. Due to 
the spherical topology of earth's surface (with a maximum 
circumference of roughly 40000 km at the Equator), the 
maximum class of spatial distances encompasses all geodesic 
distances between 19990 km and 20000km. As reasonably 
expected, the distribution shows two users with a smaller 
spatial distance have a higher probability of being socially 
connected, where the distribution roughly follows a power 
law. This confirms other study's results, such as [34] and 
supports the assumption that social closeness and spatial 
closeness mutually imply each other to a certain extent. As 
the diagram depicted in the left corner of the diagram shows, 
the geographic distribution of the users concentrates on the 
densely populated areas of North America and Europe. The 
dip of the curve around ~ 5000/cra may be explained by the 
relative geometric dimensions of the Atlantic ocean and the 
North American and European continent. 

Sub-Figure 3(d) shows the correlation between the spatial 
distance of pairs of users (this time counted in classes of 50 
km steps) and the semantic similarity of their information 
spaces (counted in equivalence classes of 1 %). The semantic 
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Figure 3: General properties of the dataset and mutual implication of social, semantic and spatial closeness 
using different measures (compare discussion in the text) [14]. Wherever a curve fit is provided (e.g. a power 
law, linear or logarithmic function), standard regression [45] is used where R 2 — l — J2i(yi~f( x iifl)) 2 /^2i{yi—y) 2 
is the coefficient of determination [14]. 



similarity of information spaces was computed as the Tani- 
moto coefficient [36] of the multi-set of term-frequency vec- 
tors of the respective sets of information items. Other alter- 
natives would have e.g. been to use Rocchio centroids [27]. 
As an implementation, we used Lucene [1]. Of a matrix con- 
taining the absolute frequency of occurrences for a combina- 
tion of a geodesic distance class and class of semantic similar- 
ity of information spaces we computed the average absolute 
frequencies for the four new equivalence classes [0, 100km], 
[100,1000km], [1000,5000km], and [5000,20000km]. The 
four qualitatively Gaussian curves show that for larger dis- 
tances the semantic similarity of the information spaces of 
the users is smaller than for smaller distances. This supports 
the connection between semantic relatedness and geographic 
relatedness. Qualitatively similar results have been obtained 
by [22] although the measures used were different. 

Sub-Figure 3(e) depicts a correlation between the seman- 
tic similarity of information spaces of users (computed as in 
sub- figure 3(d)) and their average path distance in the social 
network. (Technically: of a matrix containing the absolute 
frequency of occurrences for a combination of a class of se- 
mantic similarities between [x,x + 1]% and a path distance 
in the social network, we computed for each class of seman- 
tic similarities between [x, x + 1]% the average over all path 
distances between and 25). The result shows that the more 
similar the information spaces the smaller is the average so- 
cial distance between the respective users. This supports the 
correlation between social closeness and semantic closeness. 

Sub-Figure 3(f) shows a correlation between the geographic 
similarity of information spaces of users and their semantic 
similarity. While semantic similarity was computed in the 
same way as in 3(e) and 3(d), the geographic similarity of 
information spaces of users was computed in the following 
way: In order to compute a spatial relevance density for 
the information space of a user, a point-like spatial refer- 
ence \i — (fii,fi2) of an information item was transformed 
into a Gaussian density contribution J\f(p, cr)(x) with di- 
agonal sigma corresponding to a 500 km circle, cut off at 
\x - p\ = 500 km with the help of ArcGis [2]. All contri- 
butions (which properly respected the spherical geometry of 
earth's surface) were added to yield a user u^s spatial rele- 
vance density pi(x). The geographic similarity sim(i^, Uj) of 
the information spaces of two users Ui and Uj was computed 
via a Jaccard-like measure: 



sim(i^, Uj) 



mm(pi(x),pj(x)) 



</ d 2 xpi(x)J d 2 xpj(x)) 



(1) 



Although most of the information spaces had a similarity of 
(this large contribution was left out of the diagram) the 
values show a trend that the closer the geographic similar- 
ity of information spaces, the larger the semantic similarity. 
Although the slope of this trend is rather small, this find- 
ing supports the correlation between geographic reference of 
information spaces and their semantic similarity. 

Relating this geographic similarity of information spaces 
to the spatial geodesic distance between users as shown in 
sub- figure 3(f), yields a logarithmic trend supporting the 
reasonable connection that spatial closeness of users also im- 
plies similarity in the spatial references of their information 
spaces. 

Sub-Figure 3(h) relates the social similarity between users 
computed as the Jaccard-index of the sets of friends of two 
users and the respective average semantic similarity of infor- 



mation spaces (where the semantic similarity of information 
spaces is computed as in sub-figures 3(d), 3(e), and 3(f). We 
see a power law relating the two quantities: the more socially 
similar two users are, the more similar are their friend-sets 
and vice versa. This supports the connection between social 
and semantic contexts. 

These preliminary results are an excellent ground for fu- 
ture research, investigating the connections between social, 
spatio-temporal and semantic contexts. 

4.4 Information Retrieval Experiments 

The results just discussed show that the dataset can be 
viewed as a dataset realistically including and relating social, 
spatial and semantic elements. They support the basic find- 
ings of subsection 2.1 and subsection 2.2 and the grounds for 
the IR approach discussed in section 3. In order to evaluate 
the basic suitability of these connections for IR, IR experi- 
ments were conducted with the data- set. 

As queries, Tweets were used. In the absence of real 
user assessments of relevance to be used as ground truth for 
the experiments, two implicit assessments of relevance were 
used as ground truths: As a first assessment of relevance, 
the Re-Tweets of the query Tweet were regarded as rele- 
vant. This assessment of relevance is intended to represent 
relevance with respect of the conscious information need of 
users. As a second assessment of relevance, all Tweets and 
Re- Tweets of users following (see [4]) the author of the query 
Tweet were regarded as relevant. This assessment of rele- 
vance is intended to represent relevance with respect to the 
unconscious information needs of users containing the con- 
textual seeds discussed in subsection 2.1 and subsection 2.2. 

In order to compare semantic search, social search and 
spatial search (excluding temporal aspects for reasons of 
simplicity) as a contextual bracket implicitly relating so- 
cial and semantic contexts, seven types of retrieval processes 
were tested on the data-set. For each type of retrieval, the 
50 best results (according to the IR model of the respective 
type) are retrieved and analyzed with the first (I) and second 
(II) 'ground truth' assessment of relevance by computing the 
usual confusion matrix (TP, FP, TN, and FN) and from that 
precision P and recall R [37] If less than 50 items could be 
retrieved, either the missing ones are padded with random 
items from the respective pre-flltering (e.g. geographic or so- 
cial) (variant A) before computing the measures to ensure 
comparability, or the measures are computed as is (variant 
B). 

Type 1 [Sem]: semantic search (standard IR): Use Lucene 
[1] to compute a global IR index (over all information items 
of the dataset) and decide upon the 50 best matches to the 
query Tweet using Lucene 's ranking. 

Type 2 [Soc] : social search (social pre- filtering and subse- 
quent semantic filtering) : Retrieve all information items au- 
thored by friends and friends of friends of the query Tweet's 
author, compute a local IR index on these items and decide 
upon the 50 best matches to the query Tweet using the lo- 
cal index. This type of search is roughly associated with 
the expert-link-based type of social search with subsequent 
evaluation using a local IR system in the architecture. 

Type 3 [Geo]: geographic search (geographic pre-flltering 
and subsequent semantic filtering): Using our implemen- 
tation of our variant of distributed Quad- Tree and an oc- 
tagonal query geometry centered around the query Tweet's 
spatial point reference of 'radius' between 500 km and 20 



km depending on the depth of the tree in this region (corre- 
sponding to the density of information items), the spatially 
matching items were retrieved. On this set of items the 
semantically 50 best were determined as in the case of so- 
cial search. This type of search is roughly associated with 
the spatio-temporal search of the architecture on Expertises 
with a subsequent employment of local IR. 

Type 4 [SocUGeo]: social-geographic search U (using the 
union X U Y of the results of geographic X and social pre- 
filtering Y and subsequent semantic filtering with Lucene as 
in type 2 and 3). This type is roughly associated with the 
spatio-temporal search of the architecture on all knowledge 
flags (Expertises and Expert-Links) with subsequent local 
IR. 

Type 5 [SocflGeo]: social-geographic search D (using the 
intersection X n Y of the results of geographic X and social 
pre-filtering Y and subsequent semantic filtering) . This type 
of search is performed for reference purposes. 

Type 6 [RndGeo]: random pre- filtering geographic (ran- 
domly select as many items from the dataset as a geographic 
pre-filtering would deliver and perform subsequent seman- 
tic filtering). This type of search is performed for reference 
purposes to further investigate the impact of geographic pre- 
filtering and thus the role of spatial context as a contextual 
bracket. 

Type 7 [RndSoc]: random pre-filtering social (randomly 
select as many items from the dataset as a social pre-filtering 
would deliver and perform subsequent semantic filtering). 
This type of search is performed for reference purposes to 
further investigate the impact of social pre-filtering. 
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contributing to satisfying unconscious information needs via 
contextual seeds is indeed substantial, this result supports 
the proposed IR approach. In view of the role of spatial 
context as a context bracket implying semantic context to 
a certain degree, the comparison of the performance of ge- 
ographic search (Geo) compared to random pre-filtering ge- 
ographic (RndGeo) shows that indeed, Geo is significantly 
better than RndGeo. In other words, while Sem may use 
the whole set of information items to choose the 50 best 
(via the global index), Geo must choose from the consider- 
ably smaller set resulting from geographic pre-filtering and 
still delivers acceptable relative performance compared to a 
random pre-filtering. Indexing the whole set of information 
items may not be desirable for SN and MSN environments 
due to privacy considerations. Because of the connections 
between geographic closeness and social closeness, we can 
thus, in a realistic SN and MSN setting, expect that Geo 
may effectively draw from a locally richer set of relevant 
items and thus deliver even better overall performance than 
Sem. 
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Figure 5: Precision and Recall, variant B [14] 

Figure 4 shows the precision and recall values of variant 
B, where a, due to the restrictive pre-filtering, insufficient 
number of retrieved items is not padded by random items 
(which induces a pessimistic evaluation for the contextual 
search variants). Here, as a consequence, social search is 
best also for assessment I with respect to precision. 



Figure 4: Precision and Recall, variant A [14] 

Figure 4 shows the precision and recall values of variant 
A. The while for I, the first way of ground truth relevance as- 
sessment, the conventional purely semantic search performs 
best by far (in precision as well as recall), social search is 
most successful for II, the second way of ground truth rele- 
vance assessment and geographic search is still comparable 
to semantic search. If the assumption that II corresponds to 



5. CONCLUSION AND OPPORTUNITIES 
FOR FUTURE RESEARCH 

Our overall results may be interpreted as giving support 
to exploiting the concept of Spatio- Temporal Small Worlds 
and the underlying correlations between semantic, spatio- 
temporal, and social contexts for alternative IR, akin to 
Human IR in (Decentralized) Social Networking. 

However, the evaluation environment may still not take 



advantage of several of the benefits of the architecture (such 
as the power of local agent IR systems). Thus, one might 
expect that the approach is indeed able to deliver useful con- 
textual seeds especially in view of unconscious information 
needs and thus is a new alternative IR concept for SN and 
MSN environments. 

Nevertheless, the introduced study is only a starting point 
for a large body of future work on connecting social, seman- 
tic and spatio-temporal contexts for new and useful forms 
of IR. 

As has been mentioned above, a full implementation and 
real world evaluation of the architecture would be the next 
step following the usual Design Science methodology [25] . A 
special focus has to be put on evaluating the usefulness of the 
results obtained by the suggested alternative IR methods in 
terms of the extended notions of information need discussed 
above. Suitable concepts of extended versions of precision 
and recall will have to be constructed for the respective eval- 
uations. Furthermore, more variants of combining spatial, 
social and semantic retrieval criteria need to be evaluated 
in relation to the individual and social short term and long 
term context of the querying user. 
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